-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Unity][Transform] Memory planning for dynamic-shape func return #16111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Unity][Transform] Memory planning for dynamic-shape func return #16111
Conversation
|
I was not 100% sure if it is a desirable behavior to set the output buffer to its size upper bound, because a relax function can be, by default, called as many times as callee wants, and if the output of a relax function is never deallocated, it means multiple copies of upper bound memory will be retained for indeterministic life span |
|
@junrushao Just want to make sure I understand correctly: are you saying when called for multiple times, there will be multiple allocations of the output buffer? Yes there will indeed be multiple allocations if we do not use pool allocator or never release allocated buffers. I feel this behavior already exists when we do not allocate upper-bound size for the output tensor? The main purpose of this PR is to make the output tensor buffer size purely static, so that we can avoid memory fragmentation in runtime when using a pool allocator. |
|
I was thinking about the following case: allocating a tensor of shape As a compiler infra, it cannot assume runtime use cases when no specific hints/annotations are given, which, in our particular case, means it cannot assume the life span of the returned value when always allocating return tensors to its upper bound memory usage. This behavior, as it cannot assume runtime usecases, could be suboptimal if the caller of the relax function indefinitely extends the life span of the returned value, for example, keeping multiple small |
My primary concern is about over-allocated being kept for an indefinite life span, not merely a function being called multiple times.
Yeah this is definitely good to reduce fragmentation, but it also leads to unnecessary memory over-allocation and potential waste, so there's definitely a trade-off here. |
|
@junrushao Yes there is some amount of waste in terms of the storage size for sure. As long as the upper bound is as much tight as possible, I feel the “over-allocation” is not going to be a severe issue. For the number of allocations, the change in this PR does not increase the number of allocations than before. (If we don't use pool allocator, the number of allocations remains the same. And if we use pool allocator, only one allocation happens.) For the example of batching, say we can analyze the maximum possible batch size in ahead and annotate that value as the upper bound. In a serving engine, every integer between |
|
I would love to emphasize that, as a generic compiler infrastructure, it usually does not assume a single usecase or depend on runtime behavior that, for example, GCC cannot assume AVX-512 instructions always exist if not being explicitly told.
There's definitely difference between personal feelings and objective factors to be discussed in design choices, and I'd love to discuss specifically on objective factors. Let me write a bit more about the examples I drew previously in the thread, consider the case that we are calling a std::vector<NDArray> outputs;
for (int i = 0; i < 1024; ++i) {
NDArray outputs = mod["main"](...); # size = 1k, but storage = 128k;
logits.push_back(result);
}
this->outputs = outputs;It effectively means the outputs takes
As alternatives, I'd love to suggest that upper-bound allocation is not the only solution to de-fragmentation, and there's indeed well-practiced solutions, for example, bucketing, which creates a bucket |
|
@junrushao Yes I agree with you in general that a compiler is not supposed to make assumption on the runtime. In terms of the current memory management in TVM Unity, the VM makes heavy use of the pool allocator, which, as you pointed out, is not clever enough. I would be more than happy to follow up in the future to enhance the pool allocator to be more intelligent on allocation management, so that we can reach a balance between fragmentation and over-allocation. The main purpose of this PR is, still, to manage the memory fragmentation, which has proven to be an existing issue that can be severe when the memory usage gets close to the memory limit. For now, I think it is acceptable to make use of the upper-bound strategy, due to the fact that we are making heavy use of the pool allocator. I agree with you that this is not optimal, and I am happy to revise the memory planning algorithm after we have a cleverer allocator at least. |
|
Thanks for your response. To summarize, on the high-level, we both agree that fragmentation is an issue to be resolved, but differ in approaches we believe are effective and sustainable.
I believe the point you wanted to make here is that relying on pool allocator, as you pointed out repetitively in this thread, could alleviate/address the issue of over-allocation with unknown life time. This point is valid for temporary intermediate buffers, but not the case when it comes to over-allocating return tensors. To better help you understand why, let me work you through the example I gave in the previous response pasting below: As you may already tell, the over-allocation happened in Line 3 is carried through into the vector
We both agree that memory fragmentation is a problem in general, and I believe we both wanted to take stabs at solving them without shooting ourselves and other developers in the foot in usecases in the very near future. And more broadly, as static memory planning is a common pass used in every |
|
Thanks for the inputs! Your example is clear and demonstrates the weakness of the upper-bound allocation. So far we discussed the three possible allocation strategies for the output tensor:
And we discussed two runtime use cases of VM:
General cases may have more complicated use of output tensor which mix both cases above. Based on our discussion, we agree that
Though a general compiler pass is not supposed to assume the execution runtime to follow certain behavior, we believe the runtime behavior (when we clearly know it) can be a helpful information for compilers. For example, in the use case of MLC LLM, we are sure that output of VM functions (e.g., logits) will be released before the next invocation of the function. And in this case, compiling the model with upper-bound allocation for output tensors is beneficial. In consideration of this, one approach is to introduce a compilation flag in the form of a compile-time function attribute to suggest "whether to allocate output tensor statically with upper-bound estimation." For cases where we know the runtime use of output tensors (like in MLC LLM), we can enable this flag during model compilation, so that we can yield completely static runtime memory. This flag is by default not enabled, and we will keep the exact-size allocation for general cases where the output tensors may be arbitrarily used. |
|
Thanks for getting back to me @MasterJH5574! I believe C1 and C2 make our points crystal clear, where C1 is the case for over-allocated being kept for an indefinite life span, and C2 is the case for immediate memory recycling with the pooled allocator. Now moving to the discussion on S1, S2 and S3, where S1 and S2 are based on static analysis and S3 is purely runtime. The point I'd love make here is that static analysis may not be sufficient once dynamism is involved, and it might be eventually desirable to have a hybrid approach instead. To give a specific example in LLMs:
If the problem is designed specifically only for Llama2-7B, it would be relatively easy to resolve, i.e. static analysis gives a function Within the scope of this PR, it wouldn't be that complicated that we will have to find a perfect solution, and agreeing with your assessment, if the upstream framework could instruct compiler to apply certain upper bound-based approach with function attributes/annotation, it should be sufficient specifically for LLM serving so far. |
c95d45f to
45eeb8c
Compare
60cdee7 to
3cb7a0f
Compare
This PR enhances the static block memory planning pass. Prior to this PR, the memory planning only works on memory allocation that is not externally referenced. In dynamic shape settings, such memory allocation is not fully static and may lead to memory fragmentation. This PR enhances the behavior, so that for such memory allocation, we first allocate a storage with regard to its estimated upper bound (when known), and then allocate the tensor with the actual dynamic shape out from the storage. This will ensure the static memory allocation and avoid memory fragmentation.
3cb7a0f to
727575f
Compare
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
…che#16111) This PR enhances the static block memory planning pass. Prior to this PR, the memory planning only works on memory allocation that is not externally referenced. In dynamic shape settings, such memory allocation is not fully static and may lead to memory fragmentation. This PR enhances the behavior, so that for such memory allocation, we first allocate a storage with regard to its estimated upper bound (when known), and then allocate the tensor with the actual dynamic shape out from the storage. This will ensure the static memory allocation and avoid memory fragmentation.
…che#16111) This PR enhances the static block memory planning pass. Prior to this PR, the memory planning only works on memory allocation that is not externally referenced. In dynamic shape settings, such memory allocation is not fully static and may lead to memory fragmentation. This PR enhances the behavior, so that for such memory allocation, we first allocate a storage with regard to its estimated upper bound (when known), and then allocate the tensor with the actual dynamic shape out from the storage. This will ensure the static memory allocation and avoid memory fragmentation.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR adds a pass into the model compilation pipeline, which attach an attribute `"relax.memory_plan_dynamic_func_output"` for each Relax function in the IRModule. This attribute suggests that the Relax functions' output tensors, though having dynamic shapes, are statically plannable. This enhancement makes sure that in serving scenarios, our memory allcoation is completely static after stablized. So we will not be worried about continuing memory usage growth, and can allocate more memory for KV cache. This PR can be early merged, but it will not take effects until apache/tvm#16111 is merged.
This PR enhances the static block memory planning pass. Prior to this PR, the memory planning only works on memory allocation that is not externally referenced. In dynamic shape settings, such memory allocation is not fully static and may lead to memory fragmentation.
This PR enhances the behavior, so that for such memory allocation, we first allocate a storage with regard to its estimated upper bound (when known), and then allocate the tensor with the actual dynamic shape out from the storage. This will ensure the static memory allocation and avoid memory fragmentation.